6 research outputs found

    A tree based keyphrase extraction technique for academic literature

    Get PDF
    Automatic keyphrase extraction techniques aim to extract quality keyphrases to summarize a document at a higher level. Among the existing techniques some of them are domain-specific and require application domain knowledge, some of them are based on higher-order statistical methods and are computationally expensive, and some of them require large train data which are rare for many applications. Overcoming these issues, this thesis proposes a new unsupervised automatic keyphrase extraction technique, named TeKET or Tree-based Keyphrase Extraction Technique, which is domain-independent, employs limited statistical knowledge, and requires no train data. The proposed technique also introduces a new variant of the binary tree, called KeyPhrase Extraction (KePhEx) tree to extract final keyphrases from candidate keyphrases. Depending on the candidate keyphrases the KePhEx tree structure is either expanded or shrunk or maintained. In addition, a measure, called Cohesiveness Index or CI, is derived that denotes the degree of cohesiveness of a given node with respect to the root which is used in extracting final keyphrases from a resultant tree in a flexible manner and is utilized in ranking keyphrases alongside Term Frequency. The effectiveness of the proposed technique is evaluated using an experimental evaluation on a benchmark corpus, called SemEval-2010 with total 244 train and test articles, and compared with other relevant unsupervised techniques by taking the representatives from both statistical (such as Term Frequency-Inverse Document Frequency and YAKE) and graph-based techniques (PositionRank, CollabRank (SingleRank), TopicRank, and MultipartiteRank) into account. Three evaluation metrics, namely precision, recall and F1 score are taken into consideration during the experiments. The obtained results demonstrate the improved performance of the proposed technique over other similar techniques in terms of precision, recall, and F1 scores

    A geofencing-based recent trends identification from twitter data

    Get PDF
    For facilitating users from information overloading by finding recent trends in twitter, several techniques are proposed. However, most of these techniques need to process extensive data. Therefore, in this paper, a geofencing-based recent trends identification technique is proposed, which acquires data based on a geofence. Afterwards, they are cleaned and the weight of these tweet data is calculated. For that, the frequency of tweet texts and hashtags are taken into account along with a boosting factor. Thereafter, they are ranked to recommend recent trends to the user. This proposed technique is applied in developing a system using Java and python. It is compared with other relevant systems, where it demonstrates that the performance of the proposed system is comparable. Over and above, since the proposed system integrates geofencing feature, it is more preferable over other systems

    Processing Research-related Information with Machine learning and Knowledge Graphs

    No full text
    Tato disertační práce se věnuje strojovému učením v grafech vědeckých znalostí, což je oblast zabývající se jak konstrukcí a studiem algoritmů, které se mohou učit z vědeckých dat (zejména odborných dokumentů), tak i vzniklými grafy. Příspěvky této dizertační práce jsou rozděleny do čtyřech částí. Důležitým vstupem pro automatizované zpracování vědeckých dokumentů je jejich tématické zařazení, které v metadatech chybí nebo není přesné a úplné. V první části se zabýváme klasifikací odborných dokumentů do tématických kategorií souvisejících s COVID-19. V práci hodnotíme několik metod strojového učení, přičemž kromě správnosti klasifikace se také zaměřujeme na interpretaci vytvořených modelů strojového učení. Také jsme analyzovali typické chyby klasifikace, což může přispět k dalšímu rozvoji příslušných metod. V současnosti je k dispozici velké množství vědeckých dokumentů. V rámci dizertační práce bylo zkoumáno použití metod strojového učení pro určení významu jednotlivých dokumentů, který je odvozen od jejich míry citovanosti. V dizertační práci bylo zkoumáno několik faktorů, které by kvalitu predikce významu vědeckých dokumentů mohly zpřesnit, zejména využití tématické kategorizace dokumentů. V rámci experimentů byl vytvořen klasifikátor významu samostatně pro každou tématickou skupinu, což umožnilo refkletovat doménová specifika. Výsledky byly porovnány s klasifikátorem vytvořeným pro všechny dokumenty bez využití kategorizace. Relevantní informace pro automatizované zpracování vědeckých dokumentů lze získat i z externích znalostních bází, jimiž lze obsah dokumentů obohatit. Byla proto provedena studie týkající se dopadu vybrané znalostní báze (DBpedia) na klasifikační správnost. Tato studie analyzuje pouze účinky znalostní báze nezávislé na doméně, ale v budoucnu plánujeme analyzovat účinky doménově-specifických znalostních bází obsahujících informace pro konkrétní obor. Důležitým aspektem je praktické využití vědeckých informací. Poslední část dizertační práce se věnuje možnostem začlenění výsledků úloh strojového učení do znalostního grafu Open Research Knowledge Grap, což je znalostní graf popisujících vědecké dokumenty vyvinutý v Leibniz Information Centre for Science and Technology.This dissertation is devoted to machine learning in scientific knowledge graphs, a field concerned with constructing and studying methods that can learn from scientific data (especially scholarly documents) and the resulting graphs. The contributions of this dissertation are divided into four parts. An essential input for the automated processing of scholarly documents is their thematic classification, which needs to be more accurate and complete in the metadata. In the first part, we classify professional documents into thematic categories related to COVID-19. In the thesis, we evaluate several machine learning methods, while in addition to the correctness of the classification, we also focus on the interpretation of the created machine learning models. We also analyzed typical classification errors, which may contribute to further developing the respective methods. A large number of scholarly documents are currently available. As part of the dissertation, machine learning methods were investigated to determine the meaning or the importance of individual documents derived from their citation rate. The dissertation examined several factors that could improve the quality of the prediction of the importance of scientific documents, especially the use of thematic categorization of documents. As part of the experiments, a meaningful classifier was created separately for each thematic group, which made it possible to reflect domain specificities. The results were compared with a classifier built for all documents without categorizing. Relevant information for the automated processing of scholarly documents can also be obtained from external knowledge graphs, which can be used to enrich the content of scholarly documents. Therefore, a study was conducted regarding the impact of the selected knowledge graph (DBpedia) on classification accuracy. This study only analyzes domain-independent knowledge base effects, but in the future, we plan to investigate the impact of domain-specific knowledge bases containing domain-specific information. An essential aspect of the scholarly knowledge graph is the practical use of scientific information. The last part of the dissertation deals with the possibilities of incorporating the results of machine learning tasks into the Open Research Knowledge Graph, a knowledge graph describing scientific documents developed at the Leibniz Information Center for Science and Technology

    A flexible keyphrase extraction technique for academic literature

    Get PDF
    A keyphrase extraction technique endeavors to extract quality keyphrases from a given document, which provide a high-level summary of that document. Except statistical keyphrase extraction approaches, all other approaches are either domain-dependent or require a su�cient amount of training data, which are rare at present. Therefore, in this paper, a new tree-based automatic keyphrase extraction technique is proposed, which is domain-independent and employs nominal statistical knowledge; but no train data are required. The proposed technique extracts a quality keyphrase through forming a tree from a candidate keyphrase; and later, it is expanded or shrunk or remained in the same state depending on other similar candidate keyphrases. At the end, keyphrases are extracted from the resultant trees based on a value, � (which is the Maturity Index (MI) of a node in the tree), which enables flexibility in this process. A small � value would yield many and/or lengthy keyphrases (greedy approach); whereas, a large � value would yield lower and/or abbreviated keyphrases (conservative approach). Thereby, a user can extract his/her desired-level of keyphrases through tuning � value. The e�ectiveness of the proposed technique is evaluated on an actual corpus, and compared with Rapid Automatic Keyphrase Extraction (RAKE) technique

    A Highly Accurate PDF-To-Text Conversion System for Academic Papers Using Natural Language Processing Approach

    Get PDF
    Extracting text out of PDF documents is never an easy task when a higher degree of accuracy and consistency are the two main criteria to be attained. Although, there exist a considerable number of such systems; however, most of them are falling short of offering desirable performance especially when academic literature is the concern. Researches, those involved heavily in text mining and project analyzing, need an accurate and consistent supporting tool for PDF-To-Text (PTT) conversion. Therefore, in this paper, we propose a Natural Language Processing based PDF-to-text (NLPDF) conversion system, which comprises of two major steps, namely (i) reads contents from the PDF and (ii) reconstruct the text. The performance of the proposed system is evaluated via four metrics, namely Precision, Recall, F -Measure (AF), and standard deviation, and compared with eight other similar benchmarked systems available in the market. The experimental results evidently demonstrate the effectiveness of the proposed system

    Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph

    No full text
    Abstract Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly document prediction task over categorized scholarly documents. We also introduce a new approach to enhance the document representation method with a domain-independent knowledge graph to find the influential scholarly document using categorized scholarly content. As the input collection, we use the WHO corpus with scholarly documents on the theme of COVID-19. This study examines different document representation methods for machine learning, including TF-IDF, BOW, and embedding-based language models (BERT). The TF-IDF document representation method works better than others. From various machine learning methods tested, logistic regression outperformed the other for scholarly document category classification, and the random forest algorithm obtained the best results for influential scholarly document prediction, with the help of a domain-independent knowledge graph, specifically DBpedia, to enhance the document representation method for predicting influential scholarly documents with categorical scholarly content. In this case, our study combines state-of-the-art machine learning methods with the BOW document representation method. We also enhance the BOW document representation with the direct type (RDF type) and unqualified relation from DBpedia. From this experiment, we did not find any impact of the enhanced document representation for the scholarly document category classification. We found an effect in the influential scholarly document prediction with categorical data
    corecore